SYSTEM AND METHOD FOR MUSIC IDENTIFICATION 

TECHNICAL FIELD 

The technical field is music systems and, in particular, the identification of music. 
BACKGROUND 

Current methods for identifying a song in a database are based on feature 
extraction and matching. U.S. Patent 5,918,223 discloses feature extraction techniques 
for content analysis in order to retrieve songs based on similarity. U.S. Patent 6,201,176 
similarly discloses feature extraction used for retrieving songs based on minimum feature 
distance. In another method, features, such as loudness, melody, pitch and tempo, may be 
extracted from a hvimmed song, for example, and decision rules are applied to retrieve 
probable matches from a database of songs. However, it is difficult to derive reliable 
features from music samples. Additionally, feature matching is sensitive to the 
distortions of imperfect acquisition, such as improper humming, and also to noise in 
microphone-recorded music samples. Therefore, feature matching has not resulted in 
reliable searches from recorded samples. 

Other methods for identifying a song in a database do not involve processing 
audio data. For example, one method involves the use of a small appliance that is capable 
of recording the time of day. The appliance is activated when the user is interested in a 
song that is currently playing on the radio. The appliance is coupled to a computer 
system that is given access to a website operated by a service. The user transmits the 
recorded time to the website using the appliance and provides additional information 
related to location and the identity of the radio station which played the song. This 
information is received by the website together with play list timing information from the 
radio station identified. The recorded time is cross-referenced against the play list timing 
information. The name of the song and the artist are then provided to the user by the 
service through the website. Unfortunately, this method requires that the user remember 
the identity of the radio station that played the song when the appliance was activated. 
Additionally, the radio station must subscribe to the service and possess the supporting 
infrastructure necessary to participate in the service. Furthermore, the method is only 
effective for identifying music played on the radio, and not in other contexts, such as 
cinema presentations. 
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1 SUMMARY 

2 A system and method for identifying music comprising recording a sample of 

3 audio data and deriving a sample time signal from the audio data. A plurality of songs 

4 represented by time signals is sorted and the sample time signal is matched with the time 

5 signal of a song in the plurality of songs. 

6 A system and method for identifying music comprising recording a sample of 

7 audio data and deriving a sample time signal from the audio data. The sample time signal 

8 is matched with a time signal of a plurality of time signals in a database, wherein each of 

9 the plurality of times signals represents a song in the database. 

1 0 A method for identifying music comprising recording a sample of audio data and 

1 1 generating a first plurality of time signals from the sample of audio data, wherein the first 

12 plurality of time signals are generated in distinct frequency bands. A second plurality of 
p 13 time signals is generated from songs in a database, wherein the second plurality of time 

S 14 signals are generated in the same distinct frequency bands as the first plurality of time 

B 

^ 15 signals. The first plurality of time signals are matched with the second plurality of time 

nj 

r: 16 signals. 

H 17 Other aspects and advantages will become apparent from the following detailed 

1^ 18 description, taken in conjunction with the accompanying figures. 

^ 1 9 DESCRIPTION OF THE DRAWINGS 

?f1 20 The detailed description will refer to the following drawings, wherein like 

P 

21 numerals refer to like elements, and wherein: 

22 Figure 1 is a block diagram illustrating a first embodiment of a system for music 

23 identification; 

24 Figure 2 is a flow chart illustrating a first method for identifying music according 

25 to the first embodiment; 

26 Figure 3 is a diagram showing subplots demonstrating signal matching in a three 

27 song database experiment; and 

28 Figure 4 is a flow chart illustrating a second method for identifying music 

29 according to the first embodiment. 

30 DETAILED DESCRIPTION 

3 1 Figure 1 is a block diagram 100 illustrating a first embodiment of a system for 

32 music identification. A capture device 105 is used to record a sample of music, or audio 

33 data, 102 from various devices capable of receiving and transmitting audio signals, 

34 including, for example, radios, televisions and multimedia computers. Samples of music 
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1 may also be recorded from more direct sources, including, for example, cinema 

2 presentations. The capture device 105 may include a microphone 110 and an AID 

3 converter 115. Additionally, the capture device 105 may also include an optional analog 

4 storage medium 107 and an optional digital storage medixim 116. The capture device 105 

5 may be a custom made device. Altematively, some or all components of the capture 

6 device 105 may be implemented through the use of audio tape recorders, laptop or 

7 handheld computers, cell phones, watches, cameras and MP3 players equipped with 

8 microphones. 

9 The sample of music 102 is recorded by the capture device 105 in the form of an 

10 audio signal using the microphone 110. The A/D converter unit 115 converts the audio 

1 1 signal of the recorded sample to a sample time signal 117. Altematively, the audio signal 

12 of the recorded sample may be stored in the optional analog storage medium 107. The 
Q 13 capture device 105 transmits the sample time signal 1 17 to a digital processing system, 

S 14 such as a computer system 120. Altematively, the sample time signal 117 may be stored 

O 

^ 15 in the optional digital storage medium 1 16 for uploading to the computer system 120 at a 

16 later time. The computer system 120 is capable of processing the sample time signal 

17 117 into a compressed form to produce a processed sample time signal 121. 

y, 1 8 Altematively, the sample time signal 117 may be processed by a separate processor unit 

2 19 before being transmitted to the computer system 120. The computer system 120 is also 

P 20 capable of accessing a remote database server 125 that includes a music database 130. 

21 The computer system 120 may communicate with the database server 125 through a 

22 network 122, such as for example, the Intemet, by conventional land-line or wireless 

23 means. Additionally, the database server 125 may communicate with the computer 

24 system 120. Altematively, the database server 125 may reside in a local storage device of 

25 computer system 120. 

26 The music database 130 includes a plurality of songs, where each song may be 

27 represented by a database entry 135. The database entry 135 for each song is comprised 

28 of a processed time signal 140, a feature vector 145 and song information 150. The 

29 processed time signal 140 for each song represents the entire song. The song information 

30 150 may include, for example, song title, artist and performance. Additionally, the song 

3 1 information 150 may also include price information and other related commercial 

32 information. 

33 The feature vector 145 for a song in the music database 130 is determined by 

34 generating a spectrogram of the processed time signal 140 for the song and then 
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extracting features from the spectrogram. Various techniques related to discrete-time 




2 


signal processing are well known in the art for generating the spectrogram. Alternatively, 




3 


the feature vector 145 for a song may be extracted from the ongmal, unprocessed time 




4 


signal for the song. The features are represented by numeric values, and loosely represent 




5 


specific perceptual musical characteristics, such as, for example, pitch, tempo and punty. 




6 


In a first embodiment, the feature vector 145 for each song in the database 130 includes 




7 


five feature components derived from the projection of a spectrogram in the time (X) and 




8 


frequency (Y) axes. The first feature is the Michelson contrast in the X direction, which 




9 


represents the level of "beat" contained in a song sample. The second feature represents 




10 


the amount of noise in the Y direction, or the "purity" of the spectrum. The third 




11 


featxire is the entropy in the Y direction, which is calculated by first normalizing the Y 




12 


projection of the spectrogram to be a probability distribution and then computing the 




13 


Shannon entropy. The fourth and fifth features are the center of mass and the moment of 


n 


14 


inertia, respectively, of the highest three spectral peaks in the Y projected spectrogram. 




15 


The fourth and fifth features roughly represent the tonal properties of a song sample. 


rij 

pa 


16 


Features representing other musical characteristics may also be used in the feature vectors 


17 


145. 




18 


In a first method for identifying music according to the first embodiment. 


PJ 


19 


described in detail below, the sample of music 1 02 is converted into the sample time 


pi 


20 


Signal 117 and transmitted to the computer system 120. The computer system 120 


Q 

3 > 


21 


processes the sample time signal 1 17 to produce a processed sample time signal 121. The 




22 


computer system 120 applies a signal matching technique with respect to the processed 




23 


sample time signal 121 £ind the processed time signals 140 of the music database 130 to 




24 


select a song corresponding to the best match. The song information 150 corresponding 




25 


to the selected song is presented to the user. 




26 


Figure 2 is a flowchart 200 illustrating a first method for identifying music 




27 


according to the first embodiment. In step 205 the sample of music 102 is recorded by 




28 


the capture device 105 and converted into the sample time signal 117. The sample of 




29 


music 102 may be recorded, for example, at 44.1 KHz for approximately eight seconds. 




30 


However, it is understood that one of ordinary skill in the art may vary the frequency and 




31 


time specifications in recording samples of music. 




32 


In step 210 the sample time signal 1 17 is transmitted to the computer system 120 




33 


and is processed by the computer system 120 to generate a processed sample time signal 




34 


121 . The processed sample time signal 121 may be generated by converting the sample 
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1 time signal 117 from stereo to mono and filtering the sample time signal 117 using a zero- 

2 phase FIR filter with pass-band edges at 400 and 800 Hz and stopband edges at 200 and 

3 1000 Hz. The filter's lower stop-band excludes potential 50 or 60 Hz power line 

4 interference. The upper stop-band is used to exclude aliasing errors when the sample 

5 time signal 1 1 7 is subsequently subsampled by a factor of 2 1 . The resulting processed 

6 sample time signal 121 may be companded using a quantizer response that is halfway 

7 between linear and A law in order to compensate for soft volume portions of music. The 

8 processed sample time signal 121 may be companded as described in pages 142-145 in 

9 DIGITAL CODING OF WAVEFORMS, Jayant and Noll, incorporated herein by reference. 

1 0 Other techniques related to digital coding of waveforms are well known in the art and 

1 1 may be used in the processing of processed sample time signal 121. Additionally, it is 

12 understood that one of ordinary skill in the art may vary the processing specifications as 
p 13 desired in converting the sample time signal 117 into a more convenient and useable 

2 14 form. 

^ 15 Similar processing specifications are used to generate the processed time signals 

mi 

|T 16 140 in the music database 130. The storage requirements for the processed time signals 

H 17 140 are reduced by a factor of 84 compared to their original uncompressed size. The 

IB 

1=:^ 18 details of the filters and the processing of processed sample time signal 121 may differ 

^ 19 from that of processed time signals 140 in order to compensate for microphone frequency 

01 20 response characteristics. 

P 

5^ 21 In step 215 a signal match intensity is computed using a cross-correlation between 

22 the processed sample time signal 121 and each processed time signal 140 in the music 

23 database 130. A normalized cross-correlation is interpreted to be the cosine of the angle 

24 between the recorded processed sample time signal 121, «, and portions, v„ of the 

25 processed time signals 140 of database entries 135 in the music database 130: 

T 

26 cos(^) = /^ (1) 

\\m\vA\ 



27 Standard cross-correlation may be implemented using FFT overlap-save 

28 convolutions. The normalized cross-correlation in Equation 1 may also be implemented 

29 with the aid of FFT overlap-save convolution. The normalization for is precomputed. 

30 The normalization for ||v/|| is computed with the aid of the following recursion for 

3 1 intermediate variable, sr. 
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i +n- l 

M 

where v, = (e„ . . . , is a 16384 dimensional portion of the processed time signals 
140 in the music database 130 for the song that is being matched. The pole on the unit 
circle in the recursion of Equation 2 causes floating point calculations to accumulate 
errors. Exact calculation, however, is possible using 32 bit integer arithmetic, since the 
inputs are 8 bit quantities and 32 bits is sufficiently large to store the largest possible 
result for n = 16384. During step 215, the maximum absolute value of the normalized 
cross-correlation is stored to be used later in step 220. 

In step 220 the song with the maximum absolute value of the normalized cross- 
correlation is selected. The song information 150 for the selected song, including title, 
artist and performance, is presented to a user in step 225. 

The effectiveness of the signal match technique described in step 215 is 
illuminated in Figure 3, which shows subplots demonstrating signal matching in a three 
song database experiment. The subplots show the absolute value of normalized cross- 
correlation between a processed time signal obtained from a recorded sample of music 
and the processed time signals for the three songs in the database. An eight second 
portion of the first song, SONG 1 , was played through speakers and sampled to produce a 
processed time signal. The method described in Figure 2 was applied to generate a 
normalized cross-correlation for each of the three songs in the database. The large peak 
near the center of the first subplot demonstrates that the signal match intensity is greatest 
for SONG 1. No peaks exist in the subplots for SONG 2 or SONG 3 because the 
processed time signal was taken from SONG 1. In addition, the correlation values for the 
other parts of SONG 1 are also quite low. The low values are likely due to the long 
samples used (eight seconds), so that in the signal representation there is enough random 
variation in the song performance to make the match unique. The results of Figure 3 
show that a correctly matching song can be easily recognized. 

In a second method for identifying music according to the first embodiment, 
described in detail below, the sample of music 102 is converted into the sample time 
signal 117 and transmitted to the computer system 120. The computer system 120 
processes the sample time signal 1 17 to produce a processed sample time signal 121 and 
extracts features from the processed sample time signal 121 to generate a sample feature 
vector. Alternatively, the sample feature vector may be extracted directly from the 
sample time signal 117. As described above, the feature vectors 145 for the songs in the 
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1 


music database 130 are generated at the time each song is added to the music database 




2 


130. The database entries 135 in the music database 130 are sorted in ascending order 




3 


based on feature space distance with respect to the sample feature vector. The computer 




4 


system 120 applies a signal matching technique with respect to the processed sample time 




5 


signal 121 and the processed time signals 140 of the sorted music database 130, 




6 


beginning with the first processed time signal 140. If a signal match waveform satisfies a 




7 


decision rule, described in more detail below, the song corresponding to the matched 




o 

8 


processed time signal 140 is played for a user. If the user verifies that the song is correct, 




9 


the song information 150 corresponding to the matched processed time signal 140 is 




10 


presented to the user. If the user indicates that the song is incorrect, further signal 




11 


matching is performed with respect to the processed sample time signal 121 and the 




12 


remaining processed time signals 140 in the sorted order. 




13 


Figure 4 is a flow chart 400 illustrating a second method for identifying music 


.SSSA 


1 A 

14 


according to the first embodiment. The details involved in steps 405 Eind 410 are similar 


Q 


15 


to those involved in steps 205 and 210 of the flowchart 200 shown in Figure 2. 


ffj 

M= 


16 


In step 415 a sample feature vector for the processed sample time signal 121 is 




17 


generated as described above with respect to the feature vectors 145 of the songs in the 




18 


music database 130. The features extracted from the processed sample time signal 121 


a «? 


19 


are the same features extracted for the songs in the music database 130. Each feature 


w 


20 


vector 145 may be generated, for example, at the time the corresponding song is added to 


o 


21 


the music database 130, Alternatively, the feature vectors 145 may be generated at the 




22 


same time that the sample feature vector is generated. 




23 


In step 420 the distance between the sample leature vector and the database 




24 


feature vectors 145 for all oi the songs in the music database 130 is computed, reature 




25 


distance may be computed using techniques known in the art and further described in 




26 


U.S. Patent 6,201,176, incorporated herein by reierence. In step 425 the database entries 




27 


135 are sorted in an ascending order based on feature space distance with respect to the 




28 


sample feature vector. It should be clear to those skilled in the art that steps 420 and 425 




29 


may be replaced v^th implicit data structures, and that an explicit sort of the entire music 




30 


databeise 130 is not necessary. 




31 


In step 430 a first (or next) song in the sorted list is selected and a signal match 




32 


waveform is computed in step 435 for the processed time signal 140 corresponding to the 




33 


selected song in relation to the processed sample time signal 121. The specifications 




34 


involved in computing the signal match waveform in step 435 are similar to those 
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1 described above for computing the signal match intensity in step 215 of flowchart 200. 

2 However, in step 435 the entire waveform is used in the subsequent processing of step 

3 440, described in detail below, instead of using only the signal match intensity value. 

4 In step 440 a decision rule is applied to determine whether the current song is to 

5 be played for the user. Factors that may be considered in the decision rule include, for 

6 example, the signal match intensity for the current song in relation to the signal match 

7 intensities for the other songs in the music database 130 and the number of false songs 

8 already presented to the user. In Figure 3, the peak in the signal matching subplot for 

9 SONG 1 is clearly visible. The peak represents a match between a sample of music and a 

10 song in a database. The decision rule identifies the occurrence of such a peak in the 

1 1 presence of noise. Additionally, in order to limit the number of false alarms (i.e. wrong 

12 songs presented to the user) the decision rule may track the nimiber of false alarms shown 
J«i 13 and may limit the false alarms by adaptively modifying itself. 

WW 

B 14 In one implementation of the decision rule the signal match waveform computed 

LJ 

^ 15 in step 435 includes a signal cross-correlation output. The absolute value of the cross- 

16 correlation is sampled over a predetermined number of positions along the output. An 

iH^ 17 overall absolute maximum of the cross-correlation is computed for the entire song. The 

1^ 18 overall absolute maximum is compared to the average of the cross-correlations at the 

1 9 sampled positions along the signal cross-correlation output. If the overall absolute 

W 20 maximum is greater than the average cross-correlation by a predetermined factor, then the 

o 

1^ 21 current song is played for the user. 

22 In another implementation of the decision rule, the current song is played for the 

23 user only if the overall absolute maximum is larger by a predetermined factor than the 

24 average cross-correlation and no false alarms have been presented to the user. If the user 

25 has already been presented with a false alarm, then the decision rule stores the maximum 

26 cross correlation for each processed time signal 140 in the music database 130, The 

27 decision mle presents the user with the song corresponding to the processed time signal 

28 140 with the maximum cross-correlation. This implementation of the decision rule limits 

29 the number of false songs presented to the user. 

30 Another implementation of the decision rule may use a threshold to compare 

3 1 maximum cross-correlation for the processed time signals 140 for the songs in the music 

32 database 130 in relation to the processed sample time signal 121 . It is understood that 

33 variations based on statistical decision theory may be incorporated into the 

34 implementations of the decision rule described above. 
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1 If the decision rule is satisfied, the current song is played for the user in step 445. 

2 In step 450 the user confirms whether the song played matches the sample of music 

3 recorded earlier. If the user confirms a correct match, the song information 150 for the 

4 played song is presented to the user in step 455 and the search ends successfiiUy. If the 

5 decision rule is not satisfied in step 440, the next song in the sorted list is retrieved in step 

6 430 and steps 430-440 are repeated until a likely match is found, or the last song in the 

7 sorted list is retrieved in step 460. Similarly, if the user does not confirm a match in step 

8 450, steps 430-450 are repeated for the songs in the sorted list until the user confirms a 

9 correct match in step 450, or the last song in the sorted list is retrieved in step 460. 

1 0 The features extracted in step 415 and the feature vectors 145 for the songs in the 

1 1 music database 130 are used to sort the order in which the signal matching occurs in step 

12 435. The feature-ordered search, together with the decision rule in step 440 and the 

^ 13 "human-in-the-loop" confirmation of step 450 results in the computationally expensive 

Q 14 signal matching step 435 being applied to fewer songs in order to find the correct song. 

• Q 

^15 In another embodiment, a plurality of processed time signals in distinct frequency 

16 bands may be generated from the recorded sample of music 102. In addition, a plurality 

17 of processed time signals in the same fi-equency bands may be generated fi"om the 

Li 18 database entries 135. The signals in the individual bands may be matched with each other 

19 using normalized cross-correlation or some other signal matching technique. In this case, 

Isal 

20 a decision rule based, for example, on majority logic can be used to determine signal 

21 strength. A potential advantage of this embodiment may be further resistance to noise or 

22 signal distortions. 

23 In another embodiment, multiple feature vectors may be generated for one or 

24 more songs in the music database 130. The multiple feature vectors are generated from 

25 various segments in a song. Separate entries are added to the music database 130 for each 

26 feature vector thus generated. The music database 130 is then sorted in an ascending 

27 order based on feature space distance between a sample featxire vector taken fi:'om a 

28 sample of music and the respective feature vectors for the entries. Although this may 

29 increase the size of the music database 130, it may reduce search times for songs having 

30 multiple segments with each segment possessing distinct features. 

3 1 While the present invention has been described in connection with an exemplary 

32 embodiment, it will be understood that many modifications will be readily apparent to 

33 those skilled in the art, and this application is intended to cover any variations thereof. 



HP 10014315-1 



9 



